Project: RNA-seq and ATAC-seq analysis of structural data

Author

Gianfranco Botta, Guido Putignano, Eric Reber

Published

January 12, 2024

Introduction

Our immune system, comprising both specific and unspecific cells, plays a crucial role in protecting organisms from pathogens. However, recent discoveries suggest that the scope of pathogen defense extends beyond the traditional immune system, with structural cells contributing significantly. This project aims to elucidate a novel aspect of pathogen defense mechanisms provided by structural cells. The initial section provides an extensive overview of our methodologies and findings, referencing pivotal studies that laid the groundwork for our research. We critically analyze the strengths of our approach, highlighting potential areas for further improvement and exploration. In the second section, we delve into an organ-based analysis utilizing cutting-edge techniques such as RNA sequencing (RNAseq) and Assay for Transposase-Accessible Chromatin sequencing (ATACseq). This detailed examination sheds light on the unique roles and regulatory mechanisms of structural cells in various organs. The third section is dedicated to comparing the effectiveness and insights gained from each analytical tool, specifically focusing on their utility in organ-based research. This comparative analysis enhances our understanding of the intricate interactions within structural cells across different organs. Our project’s goal transcends mere replication of existing analyses. By juxtaposing our findings with those of previous authors, we aim to validate and build upon their results. This approach not only corroborates the existing data but also enriches the interpretation, offering a more profound and nuanced understanding of structural cells’ role in pathogen defense.

Immune cells can be broadly classified into two categories: Innate Immunity and Adaptive Immunity. Each category comprises various cell types with distinct roles in the immune response. However, even though these cell types such as T-cells, B-cells, Dendritic cells and Granulocytes have a fundamental importance in the defence of the organism, recent research has found structural cells to play an imporant role in the response to infection and inflammation (Minton 2020) (Krausgruber et al. 2020) (Tuong and Clatworthy 2020) (Gomes and Teichmann 2020). Specifically, our cells of interest include epithelial cells, endothelial cells and fibroblasts.

Cell Type Location Function and Role in Pathogen Defense
Epithelial Cells Present on organ surfaces Provide more than just support; involved in responses to infection either directly or through interactions with immune cells.
Fibroblasts Part of the connective tissue Help maintain the extracellular matrix material surrounding cells.
Endothelial Cells Line the interior of vessels (e.g., blood vessels) Involved in responses to infection either directly or through interactions with immune cells, alongside epithelial cells.

The main area of research in this case, mainly focuses on describing how structural cells can have an impact on the protection of the organism through the interaction with cells from the immune system. The major takeaways include:

  1. Unrealized Potential in Promoter Accessibility: The study identifies open, accessible promoters with low levels of expression, termed as ‘unrealized potential.’ This suggests these genes are likely poised for a rapid response upon infection, indicating a pre-emptive genetic mechanism in anticipation of pathogenic challenges.

  2. Similarity Across Structural Cells within the Same Organ: A significant observation is that different structural cell types within the same organ exhibit more similarities with each other than the same cell type does across different organs. This finding underscores the influence of the organ-specific microenvironment on cellular behavior and genetic expression.

  3. Distinct Genetic Responses to LCMV and Cytokines: The research highlights that the genetic reaction of structural cells varies significantly between Lymphocytic Choriomeningitis Virus (LCMV) infection and cytokine stimulation. This differential response opens avenues for potential new combination therapies, tailored to specific pathogens and inflammatory stimuli.

Our analysis will mainly focus on the first two points, using RNA-seq and ATAC-seq pipelines as our main tool for investigation. We will use RNA-seq to quantify gene expression levels, and use ATAC-seq to assess the chromatin accessibility.

Data

The data

We are working with bulk RNA-seq data and bulk ATAC-seq data from the paper “Structural cells are key regulators of organ-specific immune responses” by Krausgruber et al. The paper focuses on identifying unrealized epigenetic potentials in structural cells across different organs in mice (mus musculus). We are working with a subset of the data, namely the structural cells of 8 organs; the brain, heart, lung, liver, spleen, kidney, large intestine and small intestine. There are 3 types of structural cells investigated in the paper which we will work with, namely endothelial, epithelial and fibroblast cells.

The RNA-seq data is obtained from 3 different replicates of C57BL/6J mice, whilst the ATAC-seq data was obtained from 2 different replicates. Both the RNA-seq data and the ATAC-seq data were obtained using Illumina single-end sequencing.

Preprocessing

We obtain our RNA-seq and ATAC-seq data from SRA by running the accession list through the fasterq-dump tool.

cat SRR_Acc_List.txt | xargs fasterq-dump --threads 8 --progress

We use FastQC in order to obtain our quality control results both for our RNA-seq data and our ATAC-seq data.

RNA-seq Quality Control

Inital RNA QC

For our RNA-seq data we obtain failing grades for per base sequence content and per sequence GC content, which are commonly ignored. Just like in ATAC-seq, we receive warnings for sequence duplication warnings, however, for RNA-seq data this is section can be ignored. FastQC will yield these errors for our data as it was originally intended to use whole genome data #fact check

As for the overrepresented sequences, a trimming is often applied if there are also many samples that fail in adapter content. Unlike our ATAC-seq quality control, this is not the case here. For the few samples for where there are failing grades, we will rely on using soft trimming during the mapping step using STAR. Thus, we continue with the data as-is into our mapping.

ATAC-seq Quality Control

Initial ATAC

We obtain warnings and failing grades for adapter content for our endothelial and epithelial data, along with failing grades for sequence duplication levels. In the quality control reports for adapter content there is a presence the Nextera transposase sequences, which are commonly used in ATAC-seq data obtained from Illumina. In order to remove the adapter content, we use fastp to trim the sequences for the adapter. As we want to trim the Nextera transposase sequences, we specify that we wish to trim CTGTCTCTTATACACATCT. In order to address the sequence duplication levels present in the different, we wish to perform a deduplication step. We use fastp for both our trimming and our deduplication.

fastp --dedup --adapter_sequence CTGTCTCTTATACACATCT -l 25 -i $file -o trimmed_dedup/$file

After performing the deduplication and trimming steps, we rerun FastQC and obtain our new quality control results.

Deduptrim Quality Control

We see that the deduplication and trimming steps fix our failing grades for adapter content, successfully removing the Nextera adapters. The sequence deduplication levels now also receive passing grades, indicating a successfull deduplication. We are still left with warnings for the sequence length distribution and per base sequence content, however these are commonly obtained warnings and are often ignored if the other sections are fine.

Mapping

The mapping is done with STAR for RNA-seq and with Bowtie2 for ATAC-seq. We will use the GRCm39 mouse genome from the UCSC Genome Browser as our reference genome, for which we will build an index.

For mapping ATAC-seq data with Bowtie2, we use the following settings in our bash script:

bowtie2 --very-sensitive \
        -p 16 \
        -X 2000 \
        -x genome/indexed_m39_genome \
        -U rawdata/${id}*.fastq.gz \
        -S genome/${id}.sam

We then use Samtools in order to convert the .sam files to .bam files.

for f in *.sam
do
    samtools view -Sb $f > $f.bam
done

For our RNA-seq data, we use the following settings for STAR:

STAR --genomeDir genome/star-index \
         --runThreadN 10 \
         --readFilesIn ${id}.fastq.gz \
         --readFilesCommand zcat \
         --sjdbGTFfile genome/gencode.vM33.annotation.gtf \
         --quantMode TranscriptomeSAM \
         --outSAMtype BAM SortedByCoordinate \
         --outFileNamePrefix mapping_transcriptome/$id/

After having completed mapping we will look at the quality of the mapping for our ATAC-seq and RNA-seq data.

Mapping QC

For our endothelial RNA data, we see a large variety in the amount of uniquely mapped reads, the amount of reads to multiple loci and unmapped reads. We have some outliers with very poor unique mapping percentages, with SRR11256465 having a below 10% mapping rate, and with SRR11256377 and SRR11256377 having a unique mapping percentage of below 20%.

Note that not all mapping percentages add up to 100%, as STAR also reports statistics such as “% of reads mapped to too many loci” and “% unmapped reads: other” which usually make up <2% of the mapping percentage which we have chosen not to include in our figure.

For the endothelial RNA data, the result is similar to our endothelium data in that the results are quite varied. We can again identify outliers; SRR11256459 and SRR11256560 are have unique mapping percentages around the 10% mark, with SRR11256442 and SRR11256458 having unique mapping percentages around 25%.

For the samples with unique mapping percentages around 10%; SRR11256459 and SRR11256460, the number of uniquely mapped reads is 1198017 and 1998939 respectively.

Again, for the fibroblast RNA data we observe that the mapping rates for uniquely mapped, unmapped and reads mapped to multiple loci is varying across samples. For fibroblast data, SRR11256383, SRR11256384 and SRR11256401 have ~10% uniquely mapping reads. We will continue with the samples that have a lower mapping rate in the differential expression analysis, however, we will keep an eye on these samples in case outliers in the differential expression analysis correlate to poorly mapped samples.

Before moving on to our RNA-seq analysis, we use RSEM in order to obtain normalized gene counts, which we will use as our input for the RNA-seq analysis.

rsem-calculate-expression --alignments \
                          -p 10 \
                          mapping_transcriptome/$id/Aligned.toTranscriptome.out.bam \
                          genome/rsem_grcm39_genecode33/rsem_grcm39_genecode33 \
                          rsem/$id/$id

RNA-seq analysis

We load in the RSEM data and retrieve additional metadata such as the ensembl_gene_id and mgi_symbol from Ensembl.

Exploratory analysis

dim(expr_endo)
[1] 56884    24
dim(expr_epi)
[1] 56884    24
dim(expr_fibro)
[1] 56884    24

We have 56884 annotated genes and 24 samples for each of our three types of structural cells.

To focus on the specific genes we’re interested in, we can examine the average expression level of these genes in each sample in our dataset.

Distributions of the average counts per sample:

grid.arrange(hist_endo, hist_epi, hist_fibro, ncol=3)

We look at in how many samples a gene was detected.

In order to get a better understanding of the distribution of our detected genes, we create a violin plot showing the number of genes detected across samples over the count for the different types of structural cells.

grid.arrange(viol_endo, viol_epi, viol_fibro, ncol=3)

From this plot we can see we have a high number of genes which are barely expressed or completely unexpressed in the three cell types. Hence, we select the expressed genes and mark them in the meta dataframe to trace them for our following analysis. Moreover, from the violin plot we can see the distribution of the expression of the three types of cells is distributed as a negative-binomial distribution.

We filter away the unexpressed genes based on a criteria rowMeans(expr_endo > 0) >= 0.5 | rowMeans(expr_endo) >= 1, the count of genes pertinent to our analysis has now reduced to 14478 for endothelial cells, 15461 for epithelial cells, and 15783 for fibroblasts. We have added a column to mark these relevant genes. Moving forward, our analysis will focus exclusively on this subset of genes.

expressed_endo <- rowMeans(expr_endo > 0) >= 0.5 | rowMeans(expr_endo) >= 1
expressed_epi <- rowMeans(expr_epi > 0) >= 0.5 | rowMeans(expr_epi) >= 1
expressed_fibro <- rowMeans(expr_fibro > 0) >= 0.5 | rowMeans(expr_fibro) >= 1
meta_genes$expressed_endo <- expressed_endo
meta_genes$expressed_epi <- expressed_epi
meta_genes$expressed_fibro <- expressed_fibro
[1] 14478
[1] 15461
[1] 15783

Comparing samples unbiasedly

To assess the correlation between samples whilst disregarding information about sample covariates, we will use Spearman’s rank correlation coefficient (SCC). This choice is due to our data not conforming to normal or log-normal distributions. Unlike Pearson’s correlation coefficient (PCC), SCC is non-parametric, making it more adaptable and robust to the nature of our data distribution.

corr_spearman_endo <- cor(expr_endo[meta_genes$expressed_endo,], method = "spearman")
corr_spearman_epi <- cor(expr_epi[meta_genes$expressed_epi,], method = "spearman")
corr_spearman_fibro <- cor(expr_fibro[meta_genes$expressed_fibro,], method = "spearman")

Therefore, we will generate two hierarchical clustering trees to analyze our samples. This approach will help us better understand the clustering patterns of our samples, specifically focusing on ‘Organ’ (our variable of interest) and ‘Replicate’ (our covariate).

Furthermore, we will also use the hierarchical clustering trees to observe the behaviour of our outlier samples, as we have marked the outliers from the mapping phase in red.

The samples display clear clustering by organ type, indicating that the organ from which they were derived has a significant impact on the results in all of the three cell types. In contrast, the distribution of replicate numbers across the clusters appears to be random, suggesting that it does not markedly affect the organ-specific clustering.

Regarding the outliers, there is some evidence of clustering among them in the three cell types. However, this may not necessarily indicate an anomaly, as all outliers are sourced from similar organs (Spleen and Liver), which naturally tend to group together in the analysis, regardless of their outlier status. However, in the epithelial clustering, there is one clear outlier for Spleen which does not cluster well with any of the other samples. Thus, as it has a poor unique mapping rate and does not seem to agree with the rest of our data, we choose to remove it from further downstream analysis.

[1] 56884    23
[1] 23  8

We did attempt to first perform batch correction using ComBat, but found that it had no effect on the quality of our clustering. Instead of using ComBat, we will account for the batch effect introduced by the replicates by incorporating it as a covariate in our differential expression analysis.

Differential expression analysis

Methods

For the differential expression, we want to compare three methods: anova test, DESeq2, and edgeR. The following table compares the three methods for differential expression analysis:

Method Description Key Features
ANOVA Test ANOVA (Analysis of Variance) is a statistical method used to compare means of two or more groups. In the context of gene expression, it can be used to determine if the expression of a gene is significantly different across different conditions or groups. - Suitable for comparing multiple groups.
- Assumes normal distribution and homogeneity of variances.
- Less suited for count data typical in RNA-seq.
DESeq2 DESeq2 is a statistical method tailored for RNA-seq data analysis. It’s designed to normalize count data and model variance for differential expression analysis. - Uses negative binomial distribution to model RNA-seq data.
- Adjusts for library size and other normalization factors.
- Provides methods for estimating variance and significance of changes.
edgeR edgeR is another popular tool for analyzing RNA-seq count data. Similar to DESeq2, it models count data using a negative binomial distribution but has some differences in its approach to normalization and variance estimation. - Effective for small sample sizes.
- Uses an overdispersed Poisson model to account for biological variability.
- Employs empirical Bayes methods for estimating dispersions.

Each method has its strengths and is best suited for different types of data sets and research objectives. The choice of method should be guided by the specific characteristics of the data and the research goals.

Run analysis

Anova test assumes a normal distribution across our data, which cannot be the case with expression data, but sometimes the approximation still works. Therefore, in the first place, we perform an anova test to then compare the results with two more advanced methods, namely edgeR and DESeq2.

# function to perform the differential expression test with ANOVA
ANOVA_test <- function(expr,
                    cond,
                    ctrl = NULL,
                    covar = NULL,
                    padj_method = p.adjust.methods){
    pval_fc <- data.frame(t(pbapply(expr, 1, function(e){
    dat <- data.frame(y = log1p(e),
                      cond = cond)
    if (! is.null(covar))
      dat <- data.frame(dat, covar)
    
    m1 <- lm(y ~ ., data = dat)
    m0 <- lm(y ~ . - cond, data = dat)
    test <- anova(m1, m0)
    pval <- test$Pr[2]
    
    avgs <- tapply(log1p(e), cond, mean)
    if (! is.null(ctrl) && sum(cond %in% ctrl) > 0){
      fc <- exp(max(avgs[names(avgs) != ctrl]) - avgs[ctrl])
    } else{
      fc <- exp(max(avgs) - min(avgs))
    }
    
    return(c(pval = unname(pval), fc = unname(fc)))
  })), row.names = rownames(expr))
  padj <- p.adjust(pval_fc$pval, method = padj_method)
  return(data.frame(pval_fc, "padj" = padj)[,c("pval","padj","fc")])
}
# pipeline for endothelial cells
#| warning: false
res_DE_endo <- ANOVA_test(expr = expr_endo[meta_genes$expressed_endo,],
                  cond = meta_endo$Organ,
                  covar = meta_endo$Replicate)

Since ANOVA assumes a normal distribution across data, we make use of the Negative Binomial (NB) distribution using edgeR or DESeq2. We can perform the edgeR and DESeq2 analyses to test if genes across organs are truly DE genes or the differential expression is just caused due to the different replicates, hence we model them with organs and replicate numbers combined.

reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
transcripts missing from tx2gene: 4332
summarizing abundance
summarizing counts
summarizing length
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 
transcripts missing from tx2gene: 4332
summarizing abundance
summarizing counts
summarizing length
reading in files with read_tsv
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 
transcripts missing from tx2gene: 4332
summarizing abundance
summarizing counts
summarizing length

For DESeq2, we use raw counts matrix as input (that is not normalized), and specify our design matrix as ~ Organ + Replicate, and then later reduce by Replicate when taking the likelihood ratio test.

# pipeline for endothelial cells
dds_endo <- DESeqDataSetFromTximport(txi_endo,
                                colData = meta_endo,
                                design = ~ Organ+Replicate)
dds_filtered_endo <- dds_endo[intersect(rownames(expr_endo)[meta_genes$expressed_endo], rownames(dds_endo)),]
dds_filtered_endo <- DESeq(dds_filtered_endo, test="LRT", reduced= ~ Replicate)
res_DESeq2_raw_endo <- results(dds_filtered_endo)
Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
design formula are characters, converting to factors

Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
design formula are characters, converting to factors

For edgeR, we also use the raw counts.

# pipeline for endothelial cells
grp_endo <- meta_endo$Organ
group_endo <- factor(grp_endo)

rep_endo <- meta_endo$Replicate
replicate_endo <- factor(rep_endo)

design_endo <- model.matrix(~ replicate_endo + group_endo)
cts_endo <- txi_endo$counts

normMat_endo <- txi_endo$length
normMat_endo <- normMat_endo/exp(rowMeans(log(normMat_endo)))
normCts_endo <- cts_endo/normMat_endo

y_endo <- DGEList(counts=cts_endo, group=group_endo)
y_endo <- calcNormFactors(y_endo)
y_endo <- scaleOffset(y_endo, normMat_endo)
y_endo <- estimateDisp(y_endo,design_endo)
fit_endo <- glmQLFit(y_endo,design_endo)

qlf_endo <- glmQLFTest(fit_endo, coef=4:10)
res_edgeR_endo <- topTags(qlf_endo, n=Inf, p.value=0.1)$table

We use quasi-likelihood F-test (QL) instead of the likelihood ratio test because it is better for DE analysis of RNA-bulk data for edgeR.

We can now compare the p-values obtained from the anova test, DESeq2, and edgeR.

Comparison in endothelial cells:

Warning: Removed 3137 rows containing missing values (`geom_point()`).
Warning: Removed 2263 rows containing missing values (`geom_point()`).
Warning: Removed 3137 rows containing missing values (`geom_point()`).

We see that the p-values are quite different between the different methods. As we want to be stringent with which genes we investigate, we proceed with only the differential expressed genes which were found using both DESeq2 and edgeR. We note that this is a conservative estimate.

ENDOTHELIAL CELLS RESULTS:

max_layer_DEG_endo
         Brain          Heart         Kidney LargeIntestine          Liver 
           387             16            171             55             85 
          Lung SmallIntestine         Spleen 
           163             88             56 

EPITHELIAL CELLS RESULTS:

max_layer_DEG_epi
         Brain          Heart         Kidney LargeIntestine          Liver 
           286             30             42            117             63 
          Lung SmallIntestine         Spleen 
            38             67             24 

FIBROBLASTS RESULTS:

max_layer_DEG_fibro
         Brain          Heart         Kidney LargeIntestine          Liver 
           173             44             68             30             49 
          Lung SmallIntestine         Spleen 
            40             35             54 

ATAC-seq analysis

Indexing processing

Sorting and indexing

ATAC-seq, short for Assay for Transposase-Accessible Chromatin using sequencing, is a powerful technique widely utilized in molecular biology. This method offers valuable insights into the chromatin accessibility of a genome. Chromatin accessibility is defined as the ease with which DNA-binding proteins, including transcription factors, can access and interact with specific regions of the DNA. When working with ATAC-seq, there can be the possibility to see the state of the chromatin, something not really possible when considering RNA-seq data.

To introduce this analysis, we briefly go through the main steps: 1. Alignment is the first activity when starting studying some ATAC-seq data. In this step, our files are transformed from SAM to BAM. In our analysis we obtain BAM files of all the three cell types after the alignment. 2. Sorting BAM files and obtaining the indexing of these files. Sorting a BAM file involves arranging the reads in a file based on their chromosomal coordinates. This is essential for efficient retrieval of specific reads or regions of interest. Without the sorting step, searching for specific reads would require scanning through the entire file, which can be time-consuming and computationally expensive. Indexing a sorted BAM file creates a small auxiliary index file that stores information about the position of reads in the sorted file. This index file enables efficient random access to specific reads or regions of interest without having to resort to scanning the entire file. After those activities, there is the possibility of continuing the computational pipeline obtaining BigWig files and calling mac2 callpeaks. However, since the indexing requires a lot of time to run, therefore we analysed intermediate steps where we load intermediate results and continue the pipeline, avoiding to run all the steps de novo.

outBAM_dir <- "ATACseq/atac_fibro_bam"

bam_files <- list.files(outBAM_dir, pattern = "\\.bam$", full.names = TRUE)

# Process each BAM file
for(file in bam_files) {
  # Generate a new filename for the sorted BAM file
  sortedBAM <- file.path(dirname(file), paste0("Sorted_", basename(file)))

  # Sort the BAM file
  sortBam(file, gsub("\\.bam$", "", basename(sortedBAM)))

  # Index the sorted BAM file
  indexBam(sortedBAM)
}

# This code is used for the Bigwig files
sortedBAM_files <- gsub("SRR", "Sorted_SRR", bam_files)
# In case of multiple files
openRegionBigWig <- gsub("\\.sam\\.bam", "_openRegions.bw", sortedBAM_files)

# loop over each file to process open regions and export to bigWig
for(i in seq_along(sortedBAM_files)) {
  # convert the 'open' reads to a GRanges object
  atacFragments_Open <- GRanges(atacReads_Open_list[[i]])
  
  # calculate coverage and export to BigWig format
  export.bw(coverage(atacFragments_Open), openRegionBigWig[i])
}

# this code is used for peak calling
output_dir <- "/Users/guidoputignano/Research/stat426/Project/Main/ATACseq/fibro_peaks"
genome_size <- "mm"  

# Assuming sortedBAM_files is your vector of sorted BAM file paths
for(i in seq_along(sortedBAM_files)) {
  # Extract the file name without the extension for naming in MACS2
  file_name <- tools::file_path_sans_ext(basename(sortedBAM_files[i]))
  
  # Construct the MACS2 command with both output and error redirected to the log file
  log_file <- paste(output_dir, "/", file_name, ".peak.log", sep="")
  macs2_command <- paste("macs2 callpeak -t", sortedBAM_files[i],
                         "--nomodel --shift -100 --extsize 200 --format BAM -g", genome_size,
                         "-n", file_name, "--outdir", output_dir,
                         "&>", log_file, "2>&1")
  
  # Print the command for debugging
  cat("Executing: ", macs2_command, "\n")
  
  # Run the MACS2 command using system()
  system(macs2_command)
}

Mapping

Depending on the cell type, there are some differences in the mapped regions. It is possible to use Rsubread’s propmapped to give us number of mapped and unmapped reads, as well as the mapping percentage.

                           NumTotal NumMapped PropMapped
Sorted_SRR11256132.sam.bam 29803943  25068048   0.841098
Sorted_SRR11256141.sam.bam 31968002  28083168   0.878477
Sorted_SRR11256142.sam.bam 35762434  30775413   0.860551
Sorted_SRR11256147.sam.bam 35263264  24978789   0.708352
Sorted_SRR11256148.sam.bam 36312991  24967149   0.687554
Sorted_SRR11256149.sam.bam 30197589  29340889   0.971630
Sorted_SRR11256155.sam.bam 25218027  22081869   0.875638
Sorted_SRR11256156.sam.bam 29112339  24550883   0.843315
Sorted_SRR11256161.sam.bam 42215289  22855868   0.541412
Sorted_SRR11256162.sam.bam 37862042  22554752   0.595709
Sorted_SRR11256163.sam.bam 23988934  14919581   0.621936
Sorted_SRR11256164.sam.bam 36096888  23835954   0.660333
Sorted_SRR11256169.sam.bam 22022067  19701760   0.894637
Sorted_SRR11256170.sam.bam 26985544  23218356   0.860400
Sorted_SRR11256185.sam.bam 34175861  29754122   0.870618
Sorted_SRR11256186.sam.bam 28223374  24335072   0.862231
Sorted_SRR11256199.sam.bam 28611200  26533234   0.927372
                           NumTotal NumMapped PropMapped
Sorted_SRR11256134.sam.bam 32272960  28915274   0.895960
Sorted_SRR11256135.sam.bam 32729942  31236532   0.954372
Sorted_SRR11256143.sam.bam 31821569  24660918   0.774975
Sorted_SRR11256144.sam.bam 20012557  18561408   0.927488
Sorted_SRR11256150.sam.bam 20936639  16020708   0.765200
Sorted_SRR11256151.sam.bam 23001938  17272338   0.750908
Sorted_SRR11256152.sam.bam 44497728  37177497   0.835492
Sorted_SRR11256157.sam.bam 25297847  18170335   0.718256
Sorted_SRR11256158.sam.bam 47336880  34607260   0.731085
Sorted_SRR11256165.sam.bam 32511208  21578171   0.663715
Sorted_SRR11256166.sam.bam 33400278  19292257   0.577608
Sorted_SRR11256171.sam.bam 28905397  24655072   0.852957
Sorted_SRR11256172.sam.bam 25085109  11113922   0.443049
Sorted_SRR11256187.sam.bam 23118274  19323418   0.835850
Sorted_SRR11256188.sam.bam 28824516  24525778   0.850865
Sorted_SRR11256189.sam.bam 36439451  32117618   0.881397
Sorted_SRR11256196.sam.bam 27304395  26539000   0.971968
Sorted_SRR11256206.sam.bam 28181973  20675618   0.733647
                           NumTotal NumMapped PropMapped
Sorted_SRR11256136.sam.bam 37487953  34950305   0.932308
Sorted_SRR11256137.sam.bam 46047718  42335509   0.919383
Sorted_SRR11256145.sam.bam 30270677  28506556   0.941722
Sorted_SRR11256146.sam.bam 45193814  41597951   0.920435
Sorted_SRR11256153.sam.bam 25228337  23369741   0.926329
Sorted_SRR11256159.sam.bam 24965279  22860470   0.915691
Sorted_SRR11256167.sam.bam 17401849  16013731   0.920232
Sorted_SRR11256168.sam.bam 25711420  24867010   0.967158
Sorted_SRR11256173.sam.bam 23342779  20685629   0.886168
Sorted_SRR11256174.sam.bam 27748692  24721918   0.890922
Sorted_SRR11256190.sam.bam 28247388  25961781   0.919086
Sorted_SRR11256191.sam.bam 18355217  17241130   0.939304
Sorted_SRR11256195.sam.bam 23565670  22611273   0.959501

These three tables describe the mapping of epithelial cells, endothelial cells, and fibroblasts respectively.

Metric Description
Mapped Reads Mapped Reads are the sequence reads that have been successfully aligned to a reference genome. A read is considered “mapped” if the alignment software finds a location in the reference genome where the read matches or closely matches, taking into account possible sequencing errors or genetic variations. Mapped reads are critical for most downstream analyses, such as variant calling, gene expression analysis, etc.
Unmapped Reads Unmapped Reads are the reads that could not be aligned to the reference genome. There could be several reasons for a read being unmapped, such as the read being too short, having poor quality, containing too many errors or ambiguities, or originating from a region not present in the reference genome (like a novel insert or a contaminant).
Mapping Percentage Mapping Percentage is the percentage of total reads that have been successfully mapped to the reference genome. It is calculated as the number of mapped reads divided by the total number of reads (mapped plus unmapped), multiplied by 100. This metric is crucial for assessing the quality of the sequencing data and the effectiveness of the alignment process.

In case there may be the interest of knowing more about the sorted files we can use some functions like:

# Run samtools flagstat for a quick summary of alignments
system(paste("samtools flagstat", sortedBAM))

Here we provide the description only in the case of the sample SRR11256152, since loop through all samples would be very time-consuming.

Category Count Description
Total Reads 44497728 + 0 QC-passed reads + QC-failed reads
Primary Reads 44497728 + 0 Primary reads
Secondary Reads 0 + 0 Secondary reads
Supplementary Reads 0 + 0 Supplementary reads
Duplicates 0 + 0 Duplicate reads
Primary Duplicates 0 + 0 Primary duplicate reads
Mapped 37177497 + 0 (83.55%) Mapped reads (Percentage Mapped)
Primary Mapped 37177497 + 0 (83.55%) Primary mapped reads (Percentage Mapped)
Paired in Sequencing 0 + 0 Paired reads in sequencing
Read1 0 + 0 First reads in pairs
Read2 0 + 0 Second reads in pairs
Properly Paired 0 + 0 Properly paired reads
With Itself and Mate Mapped 0 + 0 Reads with itself and mate mapped
Singletons 0 + 0 Singleton reads
With Mate Mapped to Different Chr 0 + 0 Reads with mate mapped to a different chromosome
With Mate Mapped to Different Chr (mapQ>=5) 0 + 0 Reads with mate mapped to a different chromosome with mapQ>=5

In the context of genomic data, identifiers like “GL456210.1,” “JH584295.1,” and “MU069434.1” represent specific sequences that are typically not part of the primary assembly of a genome. If you may want to compare the genome with a reference, you have to remove those elements, which are not of interest in our analysis, considering only the main chromosomal regions present in our read. Making use of the chromosomal regions, it is possible to see the sample was a male, as also seen from the Y chromosome. Moreover, we can note there is an evident decrease in the mapped value from the Chromosome 1 until reaching the last chromosomes. That is something normal when considering mice chromosomes.

Distribution of the mapped reads depending on the sequence

Post-alignment processing

Across lots of function for the alignment, we decided readGAlignments is the best function in our case. In general, there is also the possibility of using the function readGAlignmentPairs(), which is used when there are two reads (not our case). Considering the sample SRR11256152:

Metric Description
Total Reads There are 44,497,728 total reads, all of which are primary (none are secondary or supplementary).
Mapped Reads Out of the total reads, 37,177,497 are mapped, which corresponds to an 83.55% mapping rate. This indicates that the majority of your reads align to the reference genome.
Paired-End Reads The output indicates that there are 0 paired in sequencing, meaning there are no paired-end reads in your dataset.
Properly Paired Reads The count of 0 properly paired further confirms the absence of paired-end reads in the dataset.

When considering this analysis, the paper suggested several genes expressed depending on the cell type.

     Gene Tissue    CellType
1    Nppc  Heart Endothelium
2    Dner  Heart  Epithelium
3   Ptpn7 Kidney Endothelium
4     Mr1  Liver  Epithelium
5   Cxcr4  Liver  Epithelium
6     Hlx  Liver  Epithelium
7  Fcer1g  Liver Fibroblasts
8   Actr3  Liver Fibroblasts
9    F11r  Liver Fibroblasts
10  Ncstn  Liver Fibroblasts
11   Atf6 Spleen Endothelium
12  Arpc2 Spleen  Epithelium
13  Psmd1 Spleen  Epithelium
14   Atf6 Spleen  Epithelium
15  Sumo1 Spleen  Epithelium
16  Casp8 Spleen  Epithelium
17   Nppc Spleen Fibroblasts
18  Bend6 Spleen Fibroblasts

After obtaining the GRanges file, we can focus our understanding on the Chromosome 1 that has the majority of mapped regions. In this case, it may be interesting to study the length of the reads through the function width(atacReads), where each number represents the length of a read, and all the first six reads have a length of 51 nucleotides. This means each of these reads is 51 bases long.

In (plot-1?) there is the frequency of Read Mapping Quality (MapQ) Scores. It’s possible to notice that the quality tends to be high, except in some cases where it’s really low (identified as 1).

Distribution of frequency of Read Mapping Quality (MapQ) Scores taking the example of SRR11256152.

[1] 51 51 51 51 51 51
[1] 0 0 0 0 0 0

This result is expected, as our reads are single-end reads. If our reads were double-end reads, the result would have been different. Moreover, in this second case we should have used the function readLengths, instead of insertSizes, to see the length of our reads.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  44.00   51.00   51.00   50.99   51.00   57.00 

In the context of sequencing data, especially for single-end reads, the read length is typically consistent across all reads if the data comes from a sequencing run with a fixed read length setting. The provided output suggests that the sequencing data might have a uniform read length of 51 bases per read, which is common in many sequencing platforms where you specify a certain read length for the entire run. In any case, there are also some variation when considering the whole data.

Read length plot for epithelial cells (example) Read length plot for epithelial cells (example)

With ATAC-seq, we are interested in knowing both ends of the DNA fragments generated by the assay, since the ends indicate where the transposase inserted. This can be done only with paired-end reads. A single-end read defines only one end of the fragment, and, although computational methods to infer fragment lengths from single-end reads exist, they tend to be inaccurate. With single-end reads, there is only one position to compare, and so any reads whose 5’ ends match are considered duplicates. Thus, many false positives may result, and perfectly good reads are removed from further analysis

The gals is the form of a large list in this case.

Promoter transcript plot for epithelial cells

The PT score, or Promoter-Transcript score, is a metric used in genomic studies, particularly in the analysis of chromatin immunoprecipitation sequencing (ChIP-seq) data. The concept behind the PT score is to compare the level of a certain genomic feature (like protein binding, histone modifications, or other epigenetic marks) at promoter regions versus the rest of the transcript (transcript body).

NFR score plot for epithelial cells

The Nucleosome Free Regions (NFR) score is a metric used in genomics to identify and quantify regions in the genome that are devoid of nucleosomes. Nucleosomes are the basic units of DNA packaging in eukaryotic cells and consist of a segment of DNA wound around a core of histone proteins. Regions that are free of nucleosomes are often indicative of regulatory elements, such as promoters, enhancers, or other DNA-protein interaction sites, and are thus of great interest in the study of gene regulation and chromatin structure.

TSSE score plot for epithelial cells

TSS enrichment score is a ratio between aggregate distribution of reads centered on TSSs and that flanking the corresponding TSSs. TSS score = the depth of TSS (each 100bp window within 1000 bp each side) / the depth of end flanks (100bp each end). TSSE score = max(mean(TSS score in each window)). TSS enrichment score is calculated according to the definition at https://www.encodeproject.org/data-standards/terms/#enrichment. Transcription start site (TSS) enrichment values are dependent on the reference files used.

QC for low quality, duplicates and signal distribution

In this case, we can consider the low quality control, duplicates and signal distributions in several situations. We can take one example into account:

                       Sample   Reads Filt. RiP. RiBL.
1: Sorted_SRR11256132.sam.bam 1724877  12.7 11.3 1.620
2: Sorted_SRR11256141.sam.bam 2025523  23.0 10.1 0.582
3: Sorted_SRR11256142.sam.bam 2214846  24.2 13.4 0.613
Metric Description
Reads The total number of reads sequenced or processed.
Filt. Short for “Filtered.” Represents the number or percentage of reads filtered out due to poor quality or other criteria.
RiP. Possibly “Reads in Peaks.” Denotes the percentage of reads that fall into the identified peaks, indicative of regions of interest.
RiBL. Likely “Reads in BlackList Regions.” Indicates the percentage of reads mapping to blacklist regions, known to produce unreliable or artifact-prone signals.

In this case, there is just an example of some of elements of the data.

Metric Description
Mapped The total number of reads that were successfully aligned to the reference genome.
Dup_Percent The percentage of reads that are duplicates of other reads, which can be indicative of amplification bias or other issues in sequencing.
Genomic Feature Description
Promoter (<=1kb) Regions within 1 kilobase (kb) upstream of the transcription start site (TSS) of a gene. Promoters are crucial for gene regulation, serving as the binding site for RNA polymerase and other transcription factors.
Promoter (1-2kb) Regions located 1 to 2 kb upstream of the TSS.
Promoter (2-3kb) Regions located 2 to 3 kb upstream of the TSS. Promoter regions are typically involved in the initiation of transcription and can vary in distance from the actual TSS. Different ranges are specified to understand the distribution of peaks concerning the proximity to the TSS.
5’ UTR The untranslated region (UTR) immediately upstream of the coding sequence of a gene on the 5’ end. It’s involved in the regulation of translation and can affect the stability, localization, and efficiency of mRNA translation.
3’ UTR The untranslated region immediately downstream of the coding sequence of a gene on the 3’ end. Similar to the 5’ UTR, it plays roles in post-transcriptional regulation, affecting mRNA stability and regulation of translation.
1st Exon The first exon of a gene, which includes both the translated and untranslated regions. It is significant as it often contains both regulatory elements and the beginning of the coding sequence.
Other Exon Refers to all exons in a gene except for the first. Exons are the parts of the gene sequence that are expressed into mRNA and then translated into protein.
1st Intron The first intron of a gene, which is the non-coding sequence immediately following the first exon. Introns are spliced out during mRNA processing but can contain important regulatory elements and splice sites.
Other Intron Refers to all introns in a gene except for the first. Introns can vary greatly in size and can be involved in gene regulation.
Downstream (<=3kb) Regions located within 3 kb downstream of the transcription end site (TES) of a gene. These areas can contain regulatory elements that influence gene transcription termination, mRNA stability, or other post-transcriptional controls.
Distal Intergenic Regions that are far from known genes, typically not falling within any of the specified promoter, intronic, or exonic regions. These areas might be involved in long-range regulation or may contain as-yet-uncharacterized genes or regulatory elements.

When analysing the three cell types, we performed several analysis, among them, we wanted to understand the annotation of the genomic data for chromosome 1 (chr1) generated by the MACS (Model-based Analysis of ChIP-Seq) tool. In this case, we noticed that the cells had really similar concentration of Promoter, UTR, Introns and all the other components. For this reason, we are displaying a plot to show the distribution in the three cases. This case in particular describes the epithelial cells, but it can be generalised to all the other cell types.

GRanges object with 810 ranges and 9 metadata columns:
        seqnames              ranges strand |       annotation   geneChr
           <Rle>           <IRanges>  <Rle> |      <character> <integer>
    [1]     chr1     4877903-4878119      * | Promoter (<=1kb)         1
    [2]     chr1     4927718-4928073      * | Promoter (<=1kb)         1
    [3]     chr1     6284937-6285205      * | Promoter (<=1kb)         1
    [4]     chr1     7158790-7159439      * | Promoter (<=1kb)         1
    [5]     chr1     9615343-9615687      * | Promoter (<=1kb)         1
    ...      ...                 ...    ... .              ...       ...
  [806]     chr1 194621056-194621287      * | Promoter (<=1kb)         1
  [807]     chr1 194658452-194659371      * | Promoter (<=1kb)         1
  [808]     chr1 194774808-194775008      * | Promoter (<=1kb)         1
  [809]     chr1 194813227-194813567      * | Promoter (<=1kb)         1
  [810]     chr1 194813771-194814075      * | Promoter (<=1kb)         1
        geneStart   geneEnd geneLength geneStrand      geneId transcriptId
        <integer> <integer>  <integer>  <integer> <character>  <character>
    [1]   4878046   4916958      38913          1       18777 NM_001355712
    [2]   4927917   4968132      40216          1       21399 NM_001159750
    [3]   6284886   6346328      61443          1       12421    NM_009826
    [4]   7159144   7243852      84709          1      319263    NM_183028
    [5]   9615633   9617680       2048          1       59014    NM_021511
    ...       ...       ...        ...        ...         ...          ...
  [806] 194621129 194643600      22472          1       12490 NM_001111059
  [807] 194638067 194659267      21201          2      320400    NR_033564
  [808] 194723546 194774556      51011          2       17221    NR_132621
  [809] 194781019 194813878      32860          2       12946 NM_001355060
  [810] 194781019 194813878      32860          2       12946 NM_001355060
        distanceToTSS
            <numeric>
    [1]             0
    [2]             0
    [3]            51
    [4]             0
    [5]             0
    ...           ...
  [806]             0
  [807]             0
  [808]          -252
  [809]           311
  [810]             0
  -------
  seqinfo: 1 sequence from mm39 genome

After performing the analysis, there was the possibility of reviewing some genetic expression and how they would related to the use of Integrative Genome Viewer (IGV).The Integrative Genomics Viewer (IGV) is a high-performance genome browser used for visualizing and exploring large-scale genomic datasets. It provides researchers and scientists with a powerful tool to interactively view and analyze diverse genomic data. Obtaining our BigWig and bed files, there was the possibility of analysing them using IGV or the UCSC Genome Browser. The UCSC Genome Browser is a web-based tool that allows researchers to visualize and explore genomic data. It provides a user-friendly interface for viewing annotated genome sequences and various genomic annotations, making it a valuable resource for researchers in genomics, genetics, and related fields. In this case, it’s possible to note that there are .bed files that have been useful to identify the location of the peaks, while BigWig files is particularly useful to identify the signal intensities from high-throughput sequencing experiments. For example, when considering the relationship between BigWig and bed files, it’s possible to notice that .bed files would show the presence of peaks, but for the quantificiation BigWig files can be helpful:

In any case, .bed files can be helpful when alaysing how many peaks can be present in a particular cell type. For example, when analysing endothelial cells, Mixi file wasn’t highly expressed, and the quantity of peaks in bed files was really close to 0. At the same time, when considering F11r, Actr3, Fcer1g as examples, they were immune genes expressed in the case of fibroblasts, but some of them also had an expression in other cells, such as in endothelial cells.

Conclusions

The conclusion can be divided looking either at The RNA-seq analysis, or at the ATAC-seq analysis. Our analysis mostly focused on the evaluation of unrealized Potential in Promoter Accessibility and the Similarity Across Structural Cells within the Same Organ.

RNA-seq conclusions

In the case of RNA-seq, after performing our analysis we analysed the DE in each organ noticing that genes expression shows huge variability across different organs, which indicates there are highly specific patterns of expression, defining that cells tend to be similar organs compared to cell types. In this case, we used different tools to avoid a bias related to the analysis used. For this reason, we noticed that the statement on the similarity across structural cells within the same organ was correct.

ATAC-seq conclusions

In the case of ATAC-seq, we wanted to verify that the conclusion present in the case of RNA-seq was also correct. After the analysis of some genes such as F11r, Actr3, Fcer1g, there was an understanding that the cell type may not be highly influential compared to the organ in which the cells are. Indeed, some genes that could be influential in the case of fibroblasts, are also present in endothelial cells. Moreover, the higher level of similarity between organs can also be noticed with the similarities in the case of the genomic data for chromosome 1 (chr1) and the PT score, Nucleosome Free Regions (NFR) score, and TSS enrichment score, whose results were quite similar among cell types. When looking at unrealised potentials, it was also possible to notice the presence of some of those immune genes whose chromatin was not open in the case of steady state scenario. For this reason, also the first hypothesis was correct.

Resources

In order to achieve the goal of our project, we made use of various heterogeneous resources:

  1. The paper “Structural cells are key regulators of organ-specific immune responses” by Krausgruber et al.;
  2. SRA Run Selector to obtain the RNA-seq and the ATAC-seq data;
  3. Euler VIII, which stands for Erweiterbarer, Umweltfreundlicher, Leistungsfähiger ETH-Rechner (https://scicomp.ethz.ch/wiki/Euler);
  4. Ensembl.org to retrieve additional metadata for our analysis;
  5. RStudio to conduct our experimental analyses;
  6. Integrative Genome Viewer (IGV) for the visualisation of bed files and bwg files;
  7. Google Drive to store 70 GB of data for our analysis, they are possible to be found here;
  8. Online tutorials were followed for RNA-seq and ATAC-seq.

References

Gomes, Tomás, and Sarah A. Teichmann. 2020. “An Antiviral Response Beyond Immune Cells.” Nature 583 (7815): 206–7. https://doi.org/10.1038/d41586-020-01916-2.
Krausgruber, Thomas, Nikolaus Fortelny, Victoria Fife-Gernedl, Martin Senekowitsch, Linda C. Schuster, Alexander Lercher, Amelie Nemc, et al. 2020. “Structural Cells Are Key Regulators of Organ-Specific Immune Responses.” Nature 583 (7815): 296–302. https://doi.org/10.1038/s41586-020-2424-4.
Minton, Kirsty. 2020. “A Gene Atlas of Structural Immunity.” Nature Reviews Immunology 20 (9): 518–19. https://doi.org/10.1038/s41577-020-0398-y.
Tuong, Zewen Kelvin, and Menna R. Clatworthy. 2020. “Organ Immune Responses Dont Forget the Structural Cells.” Nature Reviews Nephrology 16 (10): 570–71. https://doi.org/10.1038/s41581-020-00334-x.